Compressed Text Indexing with Wildcards

نویسندگان

  • Wing-Kai Hon
  • Tsung-Han Ku
  • Rahul Shah
  • Sharma V. Thankachan
  • Jeffrey Scott Vitter
چکیده

Let T = T1φ 1T2φ k2 · · ·φdTd+1 be a text of total length n, where characters of each Ti are chosen from an alphabet Σ of size σ, and φ denotes a wildcard symbol. The text indexing with wildcards problem is to index T such that when we are given a query pattern P , we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only nHh + o(n log σ) +O(d logn) bits space, where Hh is the hth-order empirical entropy (h = o(logσ n)) of T .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Succincter Text Indexing with Wildcards

We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)—positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity of previous approaches by giv...

متن کامل

Indexing Compressed Text

As a result of the rapid growth of the volume of electronic data, text compression and indexing techniques are receiving more and more attention. These two issues are usually treated as independent problems, but approaches of combining them have recently attracted the attention of researchers. In this thesis, we review and test some of the more effective and some of the more theoretically inter...

متن کامل

Self-Indexing XML

Self-indexing is a technology that integrates text compression and text indexing, such that a text collection can be simultaneously compressed and indexed. The resulting representation, called a self-index of the text, takes space close to that of the compressed text, is able of reproducing any text substring, and oers indexed searching of the collection. This has been a major breakthrough in t...

متن کامل

Compression-Domain Text Indexing and Retrieval

Keyword-based text retrieval engines have been and will continue to be essential to text-based information access systems because they serve as the basic building blocks to high-level text analysis systems. Traditionally, text compression and text retrieval are teated as independent problems. Text les are compressed and indexed separately. To answer a keyword-based query, text les are rst uncom...

متن کامل

ALLSAT compressed with wildcards. Part 4: An invitation for C-programmers

The model set of a general Boolean function in CNF is calculated in a compressed format, using wildcards. This novel method can be explained in very visual ways. Preliminary comparison with existing methods (BDD’s and ESOPs) looks promising but our algorithm begs for a C encoding which would render it comparable in more systematic ways.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Discrete Algorithms

دوره 19  شماره 

صفحات  -

تاریخ انتشار 2011